Loading network data

CSV -> List of Dictionaries -> igraph

sand's underlying graph implementation is igraph. igraph offers several ways to load data, but sand provides a few convenience functions that simplify the workflow:



In [1]:

    
import sand

Read network data from csv with `csv_to_dicts`

csv_to_dicts reads a CSV into a list of Python dictionaries. Each column in the CSV becomes a corresponding key in each dictionary.

Let's load a CSV with function dependencies in a Clojure library from lein-topology into a list of Dictionaries:



In [2]:

    
edgelist_file = './data/lein-topology-57af741.csv'
edgelist_data = sand.csv_to_dicts(edgelist_file,header=['source', 'target', 'weight'])
edgelist_data[:5]









    Out[2]:





[OrderedDict([('source', 'topology.dependencies/dependencies'),
              ('target', 'clojure.core/defn-'),
              ('weight', '1')]),
 OrderedDict([('source',
               'topology.edgelist-test/syntax-quotes-add-seq-concat-list'),
              ('target', 'clojure.core/filter'),
              ('weight', '1')]),
 OrderedDict([('source',
               'topology.dependencies-test/should-compute-fn-calls-in-namespace'),
              ('target', 'clojure.core/defn'),
              ('weight', '1')]),
 OrderedDict([('source', 'example/test-when'),
              ('target', 'clojure.core/cons'),
              ('weight', '1')]),
 OrderedDict([('source', 'leiningen.topology/topology'),
              ('target', 'org.clojure/clojure'),
              ('weight', '1')])]

Use `from_edges` with an adjacency list consisting of two vertex names and an edge weight



In [3]:

    
functions = sand.from_edges(edgelist_data)
functions.summary()









    Out[3]:





'IGRAPH DNW- 107 206 -- \n+ attr: group (v), indegree (v), label (v), name (v), outdegree (v), weight (e)'

... or use `from_vertices_and_edges` with two lists of dictionaries

A richer network model includes attributes on the vertex and edge collections, including unique identifiers for each vertex.

We can use Jupyter's cell magic to generate some sample data. Here we'll represent a network of students reviewing one another's work. Students (vertices) will be in people.csv and reviews (edges) will be in reviews.csv:



In [4]:

    
people_file = './data/people.csv'



In [5]:

    
%%writefile $people_file
uuid,name,cohort
6aacd73c-0be5-412d-95a3-ca54149c9952,Mark Taylor,Day 1 - Period 6
5205741f-3ea9-4c30-9c50-4bab229a51ce,Aidin Aslani,Day 1 - Period 6
14a36491-5a3d-42c9-b012-6a53654d9bac,Charlie Brown,Day 1 - Period 2
9dc7633a-e493-4ec0-a252-8616f2148705,Armin Norton,Day 1 - Period 2









    



Overwriting ./data/people.csv



In [6]:

    
review_file = './data/reviews.csv'



In [7]:

    
%%writefile $review_file
reviewer_uuid,student_uuid,feedback,date,weight
6aacd73c-0be5-412d-95a3-ca54149c9952,14a36491-5a3d-42c9-b012-6a53654d9bac,Awesome work!,2015-02-12,1
5205741f-3ea9-4c30-9c50-4bab229a51ce,9dc7633a-e493-4ec0-a252-8616f2148705,WOW!,2014-02-12,1









    



Overwriting ./data/reviews.csv

We again load this data into Lists of Dictionaries with csv_to_dicts:



In [8]:

    
people_data = sand.csv_to_dicts(people_file)
people_data









    Out[8]:





[OrderedDict([('uuid', '6aacd73c-0be5-412d-95a3-ca54149c9952'),
              ('name', 'Mark Taylor'),
              ('cohort', 'Day 1 - Period 6')]),
 OrderedDict([('uuid', '5205741f-3ea9-4c30-9c50-4bab229a51ce'),
              ('name', 'Aidin Aslani'),
              ('cohort', 'Day 1 - Period 6')]),
 OrderedDict([('uuid', '14a36491-5a3d-42c9-b012-6a53654d9bac'),
              ('name', 'Charlie Brown'),
              ('cohort', 'Day 1 - Period 2')]),
 OrderedDict([('uuid', '9dc7633a-e493-4ec0-a252-8616f2148705'),
              ('name', 'Armin Norton'),
              ('cohort', 'Day 1 - Period 2')])]



In [9]:

    
review_data = sand.csv_to_dicts(review_file)
review_data









    Out[9]:





[OrderedDict([('reviewer_uuid', '6aacd73c-0be5-412d-95a3-ca54149c9952'),
              ('student_uuid', '14a36491-5a3d-42c9-b012-6a53654d9bac'),
              ('feedback', 'Awesome work!'),
              ('date', '2015-02-12'),
              ('weight', '1')]),
 OrderedDict([('reviewer_uuid', '5205741f-3ea9-4c30-9c50-4bab229a51ce'),
              ('student_uuid', '9dc7633a-e493-4ec0-a252-8616f2148705'),
              ('feedback', 'WOW!'),
              ('date', '2014-02-12'),
              ('weight', '1')])]



In [10]:

    
reviews = sand.from_vertices_and_edges(
                    vertices=people_data, 
                    edges=review_data, 
                    vertex_name_key='name', 
                    vertex_id_key='uuid', 
                    edge_foreign_keys=('reviewer_uuid', 'student_uuid'))
reviews.summary()









    Out[10]:





'IGRAPH DNW- 4 2 -- \n+ attr: cohort (v), group (v), indegree (v), label (v), name (v), outdegree (v), uuid (v), date (e), feedback (e), reviewer_uuid (e), student_uuid (e), weight (e)'

Several vertex attributes are automatically computed when the graph is loaded:



In [11]:

    
reviews.vs['indegree']









    Out[11]:





[0, 0, 1, 1]



In [12]:

    
reviews.vs['outdegree']









    Out[12]:





[1, 1, 0, 0]



In [13]:

    
reviews.vs['label']









    Out[13]:





['Mark Taylor', 'Aidin Aslani', 'Charlie Brown', 'Armin Norton']



In [14]:

    
reviews.vs['name']









    Out[14]:





['Mark Taylor', 'Aidin Aslani', 'Charlie Brown', 'Armin Norton']

Groups

Groups represent modules or communities in the network. Groups are based on the labels by default.



In [15]:

    
reviews.vs['group']









    Out[15]:





[2, 0, 3, 1]

The vertices in the lein topology data set contain fully-qualified namespaces for functions. Grouping by name isn't particularly useful here:



In [16]:

    
len(set(functions.vs['group']))









    Out[16]:





107



In [17]:

    
len(functions.vs)









    Out[17]:





107

Because sand was build specifically for analyzing software and system networks, a fqn_to_groups grouping function is built in:



In [18]:

    
functions.vs['group'] = sand.fqn_to_groups(functions.vs['label'])



In [19]:

    
len(set(functions.vs['group']))









    Out[19]:





20

This is a much more managable number of groups. We'll see one way that these groups are useful when we render a visualization of the network: